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Abstract 


The  technique  of  ridge  regression,  fi rst  proposed  by  Hoerl  and 
Kennard  (1970),  has  become  a popular  tool  for  data  analysts  faced  with 
a high  degree  of  mul ticol 1 ineari ty  in  their  data.  By  using  a ridge 
estimator,  it  was  hoped  that  one  could  both  stabilize  his  estimates 
(lower  the  condition  number  of  the  design  matrix)  and  improve  upon  the 
squared  error  loss  of  the  least  squares  estimator. 

Recently,  much  attention  has  been  focuseo  on  the  latter  objective. 
Building  on  the  work  of  Stein  (1955)  and  others,  Strawderman  (1976)  and 
Thisted  (1976)  have  developed  classes  of  ridge  regression  estimators 
which  dominate  the  usual  estimator  in  risk,  and  hence  are  rmnimax.  The 
unwieldy  form  of  the  risk  function,  however,  has  lead  these  authors 
to  minimax  conditions  which  are  stronger  than  needed. 

In  this  paper,  using  an  entirely  new  method  of  proof,  we  derive 
conditions  that  are  necessary  and  sufficient  for  minimaxity  of  a large 
class  of  ridge  regression  estimators.  The  conditions  derived  here  are 
very  similar  to  those  derived  for  minimaxity  of  some  Stein-type  estimators. 

We  also  show,  however,  that  if  one  forces  a ridge  regression  estimator 
to  satisfy  the  minimax  conditions,  it  is  quite  likely  that  the  other  goal 
of  Hoerl  and  Kennard  (stability  of  the  estimates)  cannot  be  realized. 
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1.  Introduction 

Beginning  with  the  work  of  Stein  (1955),  which  showed  that  in  higher 
dimensional  problems,  the  sample  mean  of  a multivariate  normal  distribution 
is  inadmissible  against  squared  error  loss,  much  research  has  been  aimed 
at  developing  estimators  whose  risk  functions  dominate  that  of  the  sample 
mean.  More  recently,  a new  estimation  procedure,  ridge  regression,  has 
been  developed  to  improve  upon  the  numerical  stability  of  the  least 
squares  estimator  in  linear  regression.  Although  the  original  purpose 
of  the  ridge  regression  estimator  was  not  to  dominate  the  risk  of  the 
least  squares  estimator,  recent  research  has  gone  in  that  direction. 

In  the  present  paper  we  develop  a class  of  ridge  regression  estimators 
and,  utilizing  a new  method  of  proof,  derive  necessary  and  sufficient 
conditions  for  these  estimators  to  be  minimax,  and  thus  dominate  the 
least  squares  estimator  in  risk.  We  also  point  out  that  "forcing"  ridge 
regression  estimators  to  be  minimax  makes  it  nearly  impossible  for  them 
to  provide  the  numerical  stability  for  which  they  were  originally 
intended. 

We  start  with  the  familiar  linear  model 

Y = ZB  + e,  (1.1) 

where  Y is  an  nxl  vector  of  observations,  Z is  the  known  nxp  design  matrix 

of  rank  p,  b is  the  pxl  vector  of  unknown  regression  coefficients,  and  c is 

nxl  vector  of  experimental  errors.  We  assume  that  e has  a multivariate 

2 

normal  distribution  with  mean  vector  zero  and  covariance  matrix  a I . (I 

n n 

denotes  the  nxn  identity  matrix.) 
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A second  deficiency  in  e was  first  noted  by  Hoerl  and  Kennard 
(1970).  If  the  matrix  Z arises  from  observation  rather  than  from  a 
designed  experiment,  it  is  possible  that  there  will  be  high  correlation 
among  the  Z variables.  This  will  lead  to  a Z'Z  matrix  that  is  "nearly 
singular",  i.e.  Z'Z  will  have  a wide  eigenvalue  spectrum.  If  this  is  the 
case,  Hoerl  and  Kennard  point  out  that  the  least  squares  estimator  B will 
be  "unstable"  in  the  sense  that  a nearly  singular  Z'Z  will  produce  an 
inverse  with  inflated  diagonal  values,  and  (see  (1.2))  small  changes  in 
the  observations  might  produce  large  changes  in  b.  To  correct  this 
problem,  they  proposed  the  ridge  estimator 

B(k)  = (Z'Z  + kIp)-1Z'Y  (1.7) 

where  k is  a positive  number.  Adding  the  number  k before  inverting  amounts 
to  increasing  each  eigenvalue  of  Z'Z  by  k.  This  can  be  made  clear  as 
follows:  Let  P be  the  matrix  of  orthonormal  eigenvectors  of  Z'Z,  and 
let  *i  *2  — * ■ • - Xp  its  eigenvalues.  It  follows  that 

P'Z'ZP  = D. , P’P  = I,  (1.8) 

A p 

where  = diag(Aj ,. . . ,Ap) . Then  (1.7)  can  be  written  as 

Bk  « (P* (DA  + kIp)P)_1Z'Y.  (1.9) 

To  see  how  the  ridge  estimator  is  more  stable  than  b,  we  note  that  the 
condition  number  of  the  matrix  being  inverted  in  (1.9)  is  decreased.  The 
condition  number  of  a matrix  is  a measure  of  its  i 1 1 -condi tioning , given  by 
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The  usual  estimator  of  p in  (1.1)  is  the  least  squares  estimator 

8 = (Z'ZTVy.  (1.2) 

8 minimizes  the  residual  sum  of  squares  of  the  regression,  i.e., 

min  (Y-Zb)'(Y-Zb)  = (Y-Zp) 1 (Y-Zb),  (1.3) 

8 

and  thus  is  the  estimate  which  best  "fits"  the  data.  Two  different 
ines  of  research,  however,  pointed  out  deficiencies  in  8. 

The  first  deficiency  in  8 is  its  inadmissibility.  If  we  measure 
the  loss  of  an  estimator  6 of  8 by 

L(6,8,o2)  = ~ (6-0)'Q(6-B)  (1.4) 

a 

where  Q is  an  arbitrary  positive  definite  matrix,  and  let  the  risk  of 
A be  given  by 


R(A,8,o2)  = E L(  6,  8, -j2)  , (1.5) 

then  the  results  of  Brown  (1966)  show  that  8 is  inadmissible.  Several 
authors  (e.g.  Bhattacharya  (1966),  Berger  (1976b))  have  exhibited  large 
classes  of  estimators  whose  risk  function  dominates  that  of  8.  Since  8 
is  a minimax  estimator  of  8 with  constant  risk 

R(B,8,o2)  = tr  Q(Z' Z)~ 1 , (1.6) 

this  search  for  estimators  better  than  8 is  a search  for  minimax  estimators. 


5 


*(A) 


Amax(A) 
\min(A)  ’ 


(1.10) 


where  Amax(-)  and  Amin(-)  denote  the  largest  and  smallest  roots  of 
a matrix.  Large  values  of  r(A)  mean  that  A is  ill  conditioned.  Since 


A-|  + k A-j 


for  k > 0,  the  ridge  estimator  is  relieving  the  i 1 1 -condi tioning  problem 
of  Z'Z.  A straightforward  generalization  of  (1.9)  is  the  generalized 
ridge  estimator 


fj(K)  = ( P ‘ ( + K)P)_1Z'Y  (1.12) 

where  K = diag(k^ . ,k  ).  Here,  we  allow  each  eigenvalue  of  Z'Z  to 
be  increased  by  a different  amount. 

Hoerl  and  Kennard  list  many  properties  of  the  ridge  estimator,  and 
prove  the  "Ridge  Existence  Theorem".  This  theorem  asserts  that  for  a fixed 
parameter  point  fig,  there  exists  a value  of  k (or  values  of  k.,  i = l,2,...,p) 
depending  on  tig,  for  which  the  risk  of  ti(k)  is  smaller  than  the  risk  of 
6.  This  theorem,  together  with  results  arising  from  the  work  of  Stein, 
has  lead  to  the  search  for  minimax  ridge  estimators. 

In  Section  2,  we  discuss  the  canonical  form  of  the  problem,  and 
develop  the  necessary  notation.  Section  3 contains  the  asymptotic  (as  the 
parameter  value  increases)  results  needed  as  a preliminary  step  in 
developing  the  main  theorem.  Section  4 contains  the  main  theorem,  the 
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sufficient  conditions  for  minimaxity  of  the  estimators,  while  in  Section 
5 we  show  that  for  a smaller  class  of  estimators  these  conditions  are 
necessary  and  sufficient.  Section  6 contains  a discussion  about  the 
relationship  between  minimaxity  and  the  conditioning  problem. 
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2.  Th?  Canonical  Problem 

The  technique  of  simultaneous  diagonal ization  has  found  frequent 
use  in  proving  minimaxity  of  classes  of  estimators  (see,  for  example, 

Berger  (1976b)  or  Strawderman  (1976)).  The  problem  is  rotated  into 
a space  where  both  the  covariance  matrix  and  the  loss  matrix  are 
diagonal,  which  greatly  simplifies  calculations  while  preserving  minimaxity. 
However,  with  estimators  of  the  form  (1.12)  it  is  necessary  to 
simultaneously  diagonalize  three  matrices  (Z'Z,P'KP,Q)  which,  in 
general,  is  not  possible.  A sufficient  condition  for  the  simultaneous 
diagonal i zation  of  these  three  matrices  is  that  Q and  Z'Z  have  common 
eigenvectors.  In  the  absence  of  any  prior  knowledge,  an  experimenter 
will  usually  choose  Q = I or  Q = (Z'Z)  ^ and  the  simultaneous  diagonal ization 
can  be  carried  out.  However,  it  is  often  the  case  that  an  experimentor 
has  some  knowledge  of  the  losses  he  is  willing  to  incur  in  the  individual 
components,  possibly  from  cost  considerations  or  prior  knowledge. 

For  this  purpose,  it  is  worthwhile  for  the  estimator  to  perform  well 
against  an  arbitrary  choice  of  Q. 

Since  Hoerl  and  Kennard's  estimator  was  proposed  only  with  the 
choice  Q = I in  mind,  we  cannot  expect  it  to  perform  well  when  Q is 
arbitrary.  A slight  generalization,  however,  will  handle  any  choice  of 
Q.  As  an  extention  of  (1.12)  we  define 

Bg ( K ) = (Z'Z  + M 1 KM ) " 1 Z 1 Y , (2.1) 

where  M is  a non-singular  matrix  which  simultaneously  diagonalizes 
Z'Z  and  Q.  If  Q and  Z'Z  have  common  eigenvectors,  (2.1)  is  the  original 
ridge  estimator.  If  D is  the  diagonal  matrix  of  eigenvalues  of 
((HZ'ZXH-1 , M satisfies 
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M'D-1M  = Z’Z 
M'M  = Q, 

and  showing  that  Bq(K)  is  minimax  against  the  loss 
L(B, B:o2)  = L-(B-B)'Q(B-e) 

o 

can  be  reduced  as  follows.  Bg( K)  can  be  written 
Bq(K)  = ( M ’ ( D“ 1 +K ) M ) ~ 1 M ' D“ 1 Mb 
= M-1 (D“1+K)“1D_1MB. 

Let  X = Mb,  9 = Mb.  Since  b * N(b ,o2(Z ' Z) _1 ) , it  follows  that 
X ^ N(‘),o2D).  Also,  from  (2.2), 


L(B,b,o2)  = ^ (MB-Mtt) ' (MB-Mb) 

0 

= ^ (MB-o)'(MB-e) 

a 


If  we  let  6q(K)  = MBq(K),  we  have 


6n(K)  = (D" 1 +K)"1 D~ 1 X , 


where  the  i th  component  can  be  written 


6Qi(K)  = (1  ’ kT3T+T  )xi 


(2.2) 


(2.3) 


(2.4) 


(2.5) 


and  the  loss  of  (2.3)  becomes 


g 


L(sQ(K),e,u2)  = \ (6Q(K)-e)'(6Q(K)-o).  (2.6) 

It  then  follows  that  Bq(K)  is  minimax  against  loss  (2.3)  if  and  only  if 
6q(K)  is  minimax  against  the  loss  (2.6). 

In  the  following  we  will  surpress  the  dependence  of  the  estimator 
on  Q,  and  since  K will  be  a function  of  X and  s,  the  variance  estimate, 
we  will  denote  the  ridge  estimators  by  s^(X,s). 

Finally,  we  note  that  since  X is  minimax  with  constant  risk 

R(X,e,o2)  = E L(X,e,o2)  = trD, 

where  "tr"  denotes  the  trace  operator,  an  estimator  6(X,s)  is  minimax 
if  and  only  if 


a(6-0:o2)  - R(X,0,o2)  - R(6,G,o2)  < 0,  V0. 
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3.  Tail  Minimax  Conditions 

The  form  of  Hoerl  and  Kennard's  ridge  estimator,  while  intuitively 
pleasing,  leads  to  a rather  complicated  risk  function.  If  one  tries  to 
apply  Stein's  integration  by  parts  technique  (Efron  and  Morris  (1976)) 
in  which  an  unbiased  estimate  of  the  risk  is  obtained  and  bounded  above 
for  all  X,  it  seems  that  one  is  lead  to  either  bounds  that  are  not  sharp 
(Thisted  (1976))  or  additional  conditions  on  the  estimator  (Strawderman 
( 1976)).  The  proof  in  this  paper  avoids  these  complications  by  obtaining 

D 

an  upper  bound  on  the  risk  of  6 (X,s)  by  an  indirect  method. 

We  begin  with  the  concept  of  tail  minimaxity  introduced  by  Berger 
(1976a)  to  deal  with  U>sses  other  than  quadratic.  We  use  tail 

D 

minimaxity  here  to  obtain  a simplified  expression  for  the  risk  of  6 (X,s). 

Definition  3. 1 : An  estimator  6(X,s)  is  tai 1 minimax  if  a M > 0 such  that 

, 2 

Yi  satisfying  e'e  > M,  A-.6(X,s ) ,e,a  ) <_  0. 

D 

Since  6 (X,s)  shrinks  X toward  zero,  (as  can  be  seen  from  (2.5)), 
it  should  perform  well  against  quadratic  loss  for  small  values  of  0.  Thus, 
we  begin  our  investigation  for  minimax  ridge  estimators  by  examining  con- 
ditions under  which  the  risk  of  the  ridge  estimators  dominates  that  of 
X for  large  values  of  o,  i.e.,  those  that  are  tail  minimax.  We  first 
develop  conditions  under  which,  for  large  values  of  0,  the  quantity 
Ef ( X ) can  be  approximated  by  f(o)  with  error  small  enough  to  be  ignored. 

D 

We  then  use  this  approximation  on  the  risk  function  of  6 (X,s)  to 
derive  conditions  for  tail  minimaxity. 

From  the  work  of  Brown  (1971)  and  Berger  (1976a),  it  is  reasonable 
to  choose  k.  so  that  the  quantity 
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y(X.s)  = X - 6 ( X , s) , (3.1) 

is,  for  large  values  of  X'X,  approximately  c/X'X  for  some  constant 
c,  i.e. , 


Y(X,s)  c/X'X. 


(3.2) 


To  this  end,  we  consider  k.  of  the  form 

a.sr(X'D"1X/s), 

^ = - — n 

X'D  *X 


(3.3) 


where  a^  is  a positive  constant  and  r( • ) is  a bounded  function  satisfying 
certain  regularity  conditions.  While  the  quadratic  form  in  the 
denominator  may  contain  any  positive  definite  matrix  and  still 
satisfy  (3.2),  it  will  be  important  later  in  this  paper  for  the  quadratic 
form  to  follow  a non-central  chi-square  distribution. 

For  k..  as  in  (3.3),  the  ridge  estimator  of  (2.9)  can  be  written 
componentwise  as 

d a.d.r(X'D‘1X/s) 

6?(X,S)  = (1  - -LI ) x.,1  < i < p.  (3.4) 

aid-r(X,D“  X/s )+X‘ D“ *X/s 

We  start  with  the  following  lemma,  which  gives  conditions  on  a function 
f(x)  under  which,  for  large  values  of  e,  Ef(X)  can  be  approximated 
by  f(e)  with  small  error. 
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Lemma  3.1:  Let  X % N(o,  I),  and  let  the  function  f:  R p - 

i)  f has  all  second  order  partial  derivatives 

r\ 

ii)  E(f(X)  - f ( o ) ) <_  K|e|q  for  some  constants  q and  K 

iii)  sup  | f 1 J (y)  - f1J(e)|  = o(|e|'2)  1 < i,  j 

y>|o|/2 

where  flj(X)  = f (X) 


Then 


|Ef(x)  - f ( e ) | = o(|ef2). 
Proof:  Define  the  regions  M and  Wc  by 


W = {X:  | X-e  |<_  j 0 |/2} 

Wc  = {X:  | X-e | > | e |/2} 


The  Taylor  expansion  of  f about  e (up  to  second  order  terms) 


p \ J 

f(X)  = f(0)  + l (X.-e.)f  (e)  + P(X,e) 
i = l 1 1 


where 


f1(9)  = dr  f(X) 


p(x,o) 


Q-t)2 

2! 


I 

i .j 


(X.-0.)(flj(o+t(X 


satisfy 

P 


is 

(3.5) 

(3.6) 

-o))-f^(e)) 


for  some  t,  0 < t < 1.  Letting  #( • ) denote  the  cumulative  normal 
distribution  with  mean  0 and  covariance  matrix  I we  have 
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Ef(X)  = / (f(e)  + l ( X .-e . )f 1 ( e)  + P ( X , e ) } d<t(X-e) 

W i=l  1 1 


+ / f ( X)d4>(  X-o ) 


From  the  definition  of  W,  a simple  sign  invariance  argument  will  show 


/ (X.-e.  )4>(X-e)  = 0,  i = lt...,P, 
W 1 


therefore. 


Ef(X)  = f ( e)  + / P(x,e)d«(x-e) 

w 

(3.7) 

+ / (f(X)  - f ( 0 ) )d<t>(X-e) , 

IjC 


and  hence. 


|Ef(X)  - f(e)|  < / |p(x,e)|d*(X-0) 

W 

(3.8) 

+ / |f(X)  - f(e)  I d<x>( X-0 ) 

...c 


Noting  that  X €W  =,  |e  + t(X-o)  | > |o|/2  for  0 < t < 1 , we  have 


sup  |flj(0+t(X-e))  - f 1 J ( 0 ) | < sup  |flj(y)  - f1J’(e)|. 

XfeH  ~1y|  >|o  |/2 


It  then  follows  from  (3.6)  and  condition  (iii)  that 
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2 

/ |P(x,e)|d*(x-0)  < / f[  | x -o  1 1 x -e  | 

w W c i,j  1 1 J J 

x sup  | f1'^(y)-f1'^(o)  | } d4*(X-e) 

(y  | >|e  j/2 

(3.9) 

<lmax  sup  |flJ(y)-flj(o)  | \ :|X.-6.||X-n 

i,j  ly ! * | e | /2  i,j  1 1 J J 

= N max  sup  |flj(y)-flj(e) | 

i,j  |y|>|e|/2 

= o(|er2). 


where  N 
of  Wc, 


= (1/6)  J E|Xr0.||Xj-0j 


<oo. 


Also,  from  the  definition 


/ |f(x)  - f(0) |d*(X-e)  = E | f (X)  - f(e)  | 1 1 ( 1 0 |/2,®)( I X-0 1 ) 

Wc 

< { E ( f ( X ) - f(o))2EI(|ej/2>ao)(!X-0|)}1/2 

by  Holder's  inequality.  Using  the  well  known  fact  (see,  e.g.,  Chung 
(1968))  that  if  a > 0 then 


we  have 


OD  ” 

/ ? dZ  < ? 

a /sr 


-la2 


ei(|0|/2,'»)(|x"9|)  = p(|x_e|  * |0|/2) 


l p(  { |x.-e. | > |0|/2p‘l  ) 
i = l 1 1 
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< 2p  P(n(0,l ) > |e |/2pfc) 
,3/2 


exp { - I o | /8p), 


and  combining  this  with  condition  (ii)  and  the  fact  that  lim  yne'y  = Ovn 

y-Ho 

we  have 


/ |f(X)  - f ( e ) | d<*>( X-0 ) = o(  | e | -n ) 

WC 

Vn  , and  hence  the  result  follows. 

The  extension  of  Lemma  3.1  to  the  case  X ^ n(3,e),  z a known  positive 
definite  matrix,  proceeds  in  the  usual  manner  (i.e.,  diagonalizing  l), 
and  is  stated  without  proof. 

Lemma  3.2:  Let  X % N(e,z),  and  let  f:  F ^ R satisfy  conditions 

i)  - iii)  of  Lemma  3.1.  Then 

|Ef(X)  - f(e) |=  o(|9|-2). 

We  now  derive  the  asymptotic  expression  for  the  risk  of  the 

p 

estimator  6 (X,s),  given  by  (3.4),  and  the  conditions  under  which  it 
is  tail  minimax. 

Theorem  3.1:  Let  X * N(o.o2D),  D = diag(d, ,. . . ,d  ),  and  let  s % o2x2 

i p Am 

be  independent  of  X.  Let  the  loss  of  an  estimator  6(X,s)  of  e be 
given  by  (2.5),  and  let  6^(X,s)  be  the  ridge  estimator  given  by  (3.4) 
where  r(t):  R -»  [0,«»)  satisfies 
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i)  t -r 1 ( t ) = o ( 1 ) 

ii)  t3/2r"(t)  = o( 1 ) 

iii)  r(t)  is  bounded  and  non-decreasing 

iv)  r(t)/t  is  non-increasing. 

If  a ci  > 0 and  e2  > 0 such  that 

e1  < r(t)  < [2(m+2)_1 (trAD2-2\maxAD2)/xmaxA2D3]-e2,  (3.10) 

where  A = diag(a] a ),  ai  > 0,  1 < i < p,  then  a K > 0 such  that 

V o 'e  > k, 

R(6  R(  X ,S ) ,0  ,o2)  < R(X,0,o2). 


Proof:  Define 


A(6R,e,o2)  = R(6R(X,s),e,o2)  - R(  X ,9  ,o  2 ) . 


From  (2.5)  and  (3.4)  straightforward  calculation  yields 


do  o P (a.d.r(t)X. r 

A (6  ,9  ,o  ) = (l/o  ) l E{ = 

i=l  (aidir(t)+t)2 


(3.11) 


2Xi(X.-0.)aid.r(t) 


a.d.r(t)  + t 


Where  t = X'D  ^X/s.  Integrating  the  last  term  in  (3.11)  by  parts  and 
defining 
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wm  = s/a2,  Zi  = Xi/a,  v = Z’ D'1 Z , 


yields 


A ( 6 * 


2 

i0 »a  ) = 


P 

i^ 


(a^jrlv/wm))^2 

(aidir(v/wm)wm+v)2 


2aidir(v/wm 

aidir(v/wm)w 


+ 4aidir(v/wm)wniZ2  ^ 4aidi.Z2(v/wrn)r'(v/wm)  ^ 
(aTdir(v/wm)wm+v)2  °2wm(aidir(v/wm)wm+v)2 


Since  r is  non-decreasing,  the  last  term  is  bounded  above  by  zero.  Note 
that  t = X'D  ]X/s  = Z'D  ^Z/w  , and  applying  Lemma  4,  Appendix  to  the 
function  q(t)  = t"^h(t)  we  have 


E{x2h(Z1D0Z,x2)}  = mE{h(ZtD-1Z,x2+2)  (3.13) 


Using  ( 3 . 1 3 ) on  each  of  the  first  three  terms  of  (3.12),  bounding  the 
last  by  zero,  and  rearranging  terms  gives 


1-1  (aidfr(v/V2)"m*2+v) 


(3.14) 


2aidlr(,/W 

(aldlr(,/wm*2)wmt2+v> 


It  follows  from  conditions  (i)  and  (ii)  that  r(v/w)  is  non-increasing  in 
w,  and  wr(v/w)  is  non-decreasing  in  w,  and  hence  the  function 
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a.d.r(v/w)Z.  - 
qi  ^ai-di.wr(v/w)+v^ 


is  non-increasing  in  w.  Applying  Lemma  5,  Appendix  shows 


E{qi(xm+2)(xm+2  " m+2)}  - °* 


so  that  (3.14)  is  bounded  above  by 


a(6^(X,s) ,e,o2)  < 


m 


P 

l 

i = l 


E{ 


aidir(v/w)(aidir(v/w)(m+2)+4)Z2 


(a.jd.jr(v/w)w+v)‘ 


2a -d?r( v/w) 

* j l 

(a..d..r(v/w)w+v)  ' ’ 


(3.15) 


2 

where,  from  here  on,  w = wm+2  ^ xm+2*  Divide  the  region  of  integration 
of  w into  the  two  intervals 


Wq  = {w : vi  < M}, 

= {w:  w > M}, 


where  M is  a positive  constant.  The  exact  method  of  choosing  M will 

be  detailed  later  in  the  proof.  Let  g.(w,Z)  denote  the  quantity  in 

o 

braces  in  expression  (3.15)  and  let  F(-)  denote  the  cumulative  x 
distribution  with  m+2  degrees  of  freedom.  Then 


a(6R(X 


,s),e,02)  < m / ^ E7(g.(w, 

LI  L 


W0  1 = 1 


(w,Z))dF(w) 


+ m / l E?(g.(w,Z))dF(w) 

U J _ 1 *-  » 


W]  i = l 
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Consider  first  the  integral  over  Wj . Since  a.d.r(v/w)w  > 0 and 
Z ' D" 1 Z > Z^/di , 


/ l E?(gi(w,Z))dF(w)  £ / 

W1  i=l  W1  i=l 


a,d.r(v/w) 




x(a.jd2r(v/w)(m+2)+2d.j ) IdF(w) 

(3.16) 

p a-d-r*  9 

i / l E7{(~V! — )(aidfr*(m+2)+2d.)/dF(w) 

W]  i=l  L v 11  1 


= LEz{v_1tr(m+2)r*2A2D3+2r*AD2) }]  P(w  > M) 


where  r*  = sup  r(t).  Since 
t 

Ez(v_1)  = EZ(Z'D'1Z)"1  = o2/e,D'1e  + o(o2/0'0) 


the  last  expression  in  (3.16)  is  equal  to 


(o2/0 ' D”1 o ) ( tr[ (m+2 ) r*2A2D3  + 2r*AD2])P(w  > M)  + o(o2/e'e).  (3.17) 


Consider  next  the  integral  over  Wg.  It  is  straightforward  to  verify  that, 
for  fixed  w,  g..(w,Z)  satisfies  the  conditions  of  Lemma  3.1.  Thus 


/ ^ Ezg • (w,Z)dF(w)  = / \ g.(w,e/a)dF(w) 

U 4 - 1 **  1 l.l  £ - 1 


WQ  i = l 


wo  1=1 


+ / o(c>  /e' e)dF(w) 
W0 
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- / 


WQ  i-1 


aidi 


r(v/w)(a.dir(v/w)(m+2)+4)e^ 

2 ~2 

a (aid.r(u/w)w  + v) 


2a,d?r(v/w) 

— }dF(w) 

(aidir(v/w)w+v) 


(3.18) 


o(o2/e 1 o)dF(w) , 


- 1 2 

where  v = e'D  e/o  . Straightforward  calculation  will  show  that  the 
individual  terms  comprising  the  o(o2/9'0)  term  in  (3.18),  which  are 
the  higher  order  derivatives  of  g.(w,Z),  can  each  be  bounded  by  a 
function  which  is  independent  of  w and  of  order  o(a2/e'e).  Now 
wri ting 


(a.dir(v/w)w+v)"1  = v"1(l-(aidir(v/w)w/(aidir(v/w)w+v)) 

= V ^ ( 1 “Yj ( v »w) ) , 


and 


s^v.w)  = ai.d1.(aid^r(v/w)(m+2)+4)e?/a2, 


we  can  wri te  (3.18)  as 


/ f Ezg.(w,Z)dF(w)  = / { rhM 


W„  1 = 1 


i = l 


(l-Yi(v,w))L(l- 


Yj(v,w))s.j(v,w)/\ 


r(v/w) 


- 2aid2]ldF(w) 

+ o(o2/e'o) 

P 7 

I (s,(v,w)/v-2a.df)ldF(w) 
i = l 1 1 


\) 


■ / {r-'V/;W^  1 Ly1-(\>»w)(s .(v,w)/v-2a -d^ ) 

WQ  v 1=1  1 1 


( v,w)si (v,w)/v JdF(w) 


+ o(a2/0'e) 


Recall  r*  = sup  r(t).  Then  for  w fc  W, 


y^(v,w)  <_  ai-d^r*M(a  .d^  r*M+v)'  , 1 <_  i <_  p 
s-(  v,w)/v  _<  a.jd2o2(aid;r*(m+2)+4) , 1 < i <p 


and  thus  it  is  clear  that  the  second  integral  in  (3.19)  is  o(v  ) = 
2 

o(n  / 6 ’ e ) . Hence,  summing  the  first  term  in  (3.19)  yields 


/ ? E29,(w.2)dF(»)  < / 

II  - T *-  ■ l I V 


.'A2D2o+ro'AD 


W0  1=1 


0 ' D 0 


2trAD^]dF(w) 


+ o(oV0'o) 


< / { ^ -/W/-  ( AmaxA'D3) (m+2 ) 

“ W v 

W0 

2(m+2)*1  ( trAD2-2AfnaxAD2) 
x^r*'  *maxA2D3 


2 

+ 0 ( o / 0 * 0 ) , 


since 
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e'A2D2e 
T| — i 

0  * D 0 


2 3 
\ A u , 
max 


e'ADe 

O'D'1© 


< a AD 
max 


By  assumption,  the  quantity  in  square  brackets  in  (3.20)  is  bounded 

above  by  -r9,  r( v/w)  > , and  since  A A2D3  > 0, 

c — I max  * 


/ 

w 


0 


l  Ezg.(w,Z)dF(w)  < A2D3(m+2)P(w  < M)]  (3.21) 

1=1  L 1 e'D  6 1 2 max 

+ o ( o2/ 0 1 0 ) 


Combining  (3.17)  and  (3.21)  yields 

2 

a(6R,0,o2)  < — -r  {-  c.t,J  A2D3(m+2)P(w  < M) 

0 ' D" 1 0 ' 2 max 

+ ( tr [ (m+2 ) r*2 A2 D3+2 r*AD2 ] ) P ( w > M)}  (3.22) 

+ o(a2/0'o). 


Now  M is  chosen  large  enough  so  that 

*C1 c2AmaxA?°3(m+2)p(w  - M) 

+ tr[_(m+2)r*2A2D3+2r*AD2]P(w  > M)  £ -c3  < 0, 

for  some  t-,  > 0,  and  thus  from  (3.22), 

a(<5R,0,o2)  <_  - e^Ulo2 /b'D  ^0  + o(u2/o'0) 

2  2 

£ - t3mAminDo  ^B'0  * °(°  /0,0)» 

R 2 R 

and  for  sufficiently  large  o'o,  a(6  ,e,u  ) <_  0 and  so  A (X ,s)  is  tail 
minimax.  1 1 


r- 


While  Theorem  3.1  does  not  guarantee  that  the  risk  of  <5R(X,s)  will 
lie  below  that  of  X for  any  specified  values  of  o,  it  does  provide  a 
bound  on  the  tail  behavior  of  the  risk  function  of  6R(X,s).  In  the 
next  section  we  show  that  this  bound  is,  in  fact,  a global  bound. 


4.  Sufficient  Conditions  for  Mininaxity 

The  main  theorem  of  this  section.  Theorem  4.1,  extends  the  tail 
minimax  bound  of  Theorem  3.1  to  a global  bound.  We  introduce  a new 
method  of  proof,  which  differs  sharply  from  the  techniques  previously 
used  to  prove  minimaxity.  Rather  than  bounding  the  risk  function 

. 9 

pointwise  by  a function  which  lies  below  R(X,-  , ),  we  identify  the 

R 2 

extrema  of  R(x  ,e,o  ),  and  show  that  at  these  points  the  risk 

p 

function  of  6 (X,s)  is  below  that  of  X. 

p 

Theorem  4.1:  Let  fi  (X,s)  be  the  ridge  estimator  of  (3.4)  where 

r(t):  F -»  [0,=°)  satisfies  conditions  i)  - iv)  of  Theorem  3.1.  If 

0 < r(t)  £ 2(m+2)'1[trAD2-  2*mdxAD2]  AmaxA2D3 , (4.1) 

R 

V t > 0,  then  6 (X,s)  is  minimax  against  the  loss  (2.5). 

Proof:  Assume  that  the  bound  in  (4.1)  is  strict,  i.e.,  a c-j  and  c?, 

both  positive,  such  that  V t > 0 

t,  < r(t)  < (2(m+2)-1[trAD2-2A  AD2]/\  A2D3)  - r,.(4-2) 

1 ' L max  J max  ' 2 

D 

Then  from  Theorem  3.1  a M > 0 such  that  V e * 9 > M a ( <S  ,e,o)  0. 

Consider  the  set  3 = f 9 : e'9  <_  M) , a compact  sphere  in  F We  will 

p 2 R 2 

bound  a(6  ,e,o  ) by  a continuous  function  y ( 6 ,o,o  ),  which  must  have 

a maximum  on  £>.  We  will  then  show  that,  with  the  exception  of  the 

p 

point  9=0,  r(6  (X).o)  does  not  have  an  extreme  point  in  the 
interior  of  & and  thus  achieves  its  maximum  either  at  9 = 0 or  9'9  = M. 
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If  e'o  = M,  it  will  follow  from  Theorem  3.1  that  y(<5R,0 ,o2)  <_  0. 

R 2 

We  then  show  that  y(<5  , 0 ,a  ) ^ 0.  A simple  argument,  using  Fatou's 
Lemma,  then  allows  the  result  to  be  extended  to  the  case  when  the 
inequality  in  condition  (4.1)  is  not  strict. 

Using  the  notation  of  Theorem  3.1,  from  (3.15)  we  have 

a(6R,6,o2)  <_  y(6R,6,o2) 

P a .d.r(v/w)(a .d.r(v/w)(m+2)+4)Z2 

= m l E{-  -1— 1 11 ? 1 (4.3) 

i=l  (a^d^r( v/w)w+v) 

2a.d2r(v/w) 

— }, 

(a^d^ r( v/w)w+v 

-1  2 

where  -v  n(e./a,d^),  v = Z'D  Z,  w ^ xm+2  independent  of  Z.  Define 


n = 0/a , 


g. (w,Z)  = aid.r(v/w)(aidir(v/w)(m+2)+4),  (4.4) 

h. (w,Z)  = (aid.r(v/w)w+v)"1 , 


then 


y(6R.6,o2)  = m y E{g.(w,Z)h2(w,Z)Z2 
i=l  1 1 1 

- 2a..d2r(v/w)hi.(w,Z)} 


(4.5) 


2 

Letting  Xp(r«)  denote  a chi-square  random  variable  with  p degrees  of 
freedom  and  non-centrality  parameter  «/2,  we  have  from  Lenina  2, 
Appendix , 
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y(6R,n.o^)  = m £ E{gi (w  , Xp+2( ) 

+ ri^9i  (w. Xp+4(  v)  )h? (w, Xp+4(  v) ) (4.6) 

- 2aid?r( Xp( v)/w)h . (w,Xp( v) ) } , 

where  v = n'D  ^n.  We  note  the  following:  if  f(x)  is  a function  of  x 

2 

only  through  x , then  with  the  possible  exception  of  x - 0, 


d 

37 


= 0. 


R 2 

From  (4.6)  it  can  be  seen  that  y(5  ,n,o  ) is  a function  of  n only 
2 

through  n • • Thus,  with  the  possible  exception  of  n = 0,  a point 

R 2 

Hq  is  an  extreme  point  of  y (6  ,y,o  ) only  if 


3 /*R 
— r y\(>  ,y,o  ) 

8rii 


= 0,  1 < i < p 


(4.7) 


n=n 


0 


We  now  show  that  such  a point  does  not  exist.  From  Lemma  6,  Appendix, 


3 / , R 2 \ 

5-  y(*  ,Y,0  ) = 


3rt, 


E(J[g1(-.x^(v))h;(w.xJt4(v,) 


- gi(w,Xp+2(v))hi(w»Xp+2(VJ))] 

+ (ni/2di)[gi(w,Xp+6(v))h^(w,Xp+f(v)) 

- gi(w,Xp+4(v))h^(w,x^+4(v))] 


(4.8) 
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+ (aid?)[r(xp+2('j)/w)hi(w,Xp+2(v) ) 

- r(Xp(v)/w)hi(w,Xp(v))]} 

+ E{9k(w,Xp+4(v))h^(w  Xp+4(v))} 

Notice  that  the  sum  in  (4.8)  does  not  depend  on  k,  the  index  of 
differentiation.  Therefore,  denoting  the  sum  byA(w,ri) 

-^2  Yl^.n.o2)  = Ejk(w,n)  + w »Xp+4(  v))h^(w, Xp+4 ( v ) ) (4.9) 

3nk 

for  all  k,  1 1 k <_  p.  Thus,  in  order  for  (4.7)  to  be  satisfied  at  some 
point  hq  f 0,  it  must  be  the  case  that 

Egi(w,Xp+4(n))h?(w,Xp+4(ri))  = Egj(w,Xp+4(o))hj(w,x^+4(v)) 

for  all  i,j,  1 < i,  j < p.  From  (4.4), 

Eg^(w,Xp+4(v))h?(w,Xp+4(v)) 

aidir^p-t-4^v^/w^aidir^xpH(v^/w^rn't'2^4) 

(ai di r ( Xp+4( v)/w)w  + Xp+4(v))2 

and  from  Lemma  9,  Appendix,  this  is  a strictly  increasing  function  of 
a.d^.  Therefore,  (4.7)  can  be  satisfied  only  if  a^d.  = c , but 
if  this  is  the  case,  from  (4.3), 
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, r ? R cr(v/w)(cr(v/wXm+2)+4)Z2 

A(cS  ,e,o  )<m  l E{ L 

i = l (cr(v/w)w+v)2 

2cd^r( v/w) 

- V i 

cr(v/w)w+v 

- m Ffcr(v/w?  w(cr(v/w)(m+2)+4)Z'Z 

— ' 2 D)  (4-10) 

since  Z * D" 1 Z = v < cr(v/w)w+v.  Since  Z ' Z/Z  * D” 1 Z <_  AmaxD,  rearranging 
terms  in  (4.10)  shows 


R ? rrlu/ui\  2(  trD-A  D) 

> 1C»(^2)E(^  V w|v  )(r!v/w)  . -«  (4.11, 

max 


Under  the  restrictions  a^^c,  (4.2)  can  be  written 


el  i r(t)  < Z(trD-2j|nax0)/(ctin»2)ilnax0)  - ^ 

and  hence  the  right  hand  side  of  (4.11)  is  negative.  If  n = 0,  or 
equivalently  e = 0,  it  is  obvious  that 


A( 6^, 0,o^) 


0, 


p 

since  6 (X,s)  is  always  closer  to  zero  than  X.  Thus,  if  (4.1)  is 

p 

replaced  by  (4.2),  6 (X,s)  is  a minimax  estimator  of  e.  If  we  define 


(t)  = (l-c)r(t)  + ct; 


(4.12) 


where  0 < t < 1 and  c > 0 satisfies 


0 < c < 2(m+2)~1(trAD2-2A  AD2)/a  A2D3, 

max  max 
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then  the  ridge  estimator  <S  ( X , s)  given  componentwise  by 


D a.d.r  (X'D_1X/s) 

6^(X,s)  = (1  - 1 


aidir  ( X * D- 1 X/ s ) +X ' D~ 1 X/s  1 


)X,, 


satisfies  the  theorem 
minimax  , 0 < c<l . 
from  Fatou's  Lemma 


with  (4.1)  replaced  by  (4.2),  and  hence  is 
It  is  clear  that  lim  6R(X,s)  = 6R(X,s),  and  thus 

t-0  e 


R(X,e,o2)  > R(aR(X,s),q,o2) 

> lim  inf  R(<sR(X ,s)e,a2) 

t-0  c 


> Filim  inf  L(6R(X,s) ,e,a2) 

t-0  c 

= E L(6R(X,s)e,o2) 


= R( 6 (X  ,s)  ,9 ,o2) , 
and  hence  (X,s)  is  minimax. || 

Condition  (4.1)  is  essentially  the  same  condition  derived  by  other 
authors  working  with  certain  Stein-type  estimators.  For  example, 

Bock  (1975)  showed  that  the  spherically  symmetric  Stein-type  estimator 


6B(X,s) 


(1 


ar(X'D~1X/s) 
X ' D'1 X/s 


)X 


is  minimax  provided 


0 <ar(t)  < 2(m+2)_1(trD-2A  D)/A  D, 
~ - max  " max  ’ 
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which  is  exactly  the  condition  of  Theorem  4.1  if  we  choose  ad  -c 

i i 

R 

to  make  <s  (X,s)  spherically  symmetric.  If  D = I,  A = al,  then  (4.1) 
reduces  to  the  familiar 


0 < ar( t)  < 2(p-2)(m+2)-1 . 


Theorem  4.1  has  an  immediate  extension  to  a wider  class  of  functions. 
We  state  this  in  the  following  corollary. 

D 

Corollary  4_J  : Let  6 (X,s)  be  given  componentwise  by 


«J(X,s)  = (1 


aidir(X'D'1X,s) 


)X,, 


a.d.r(X'D"1X,s)+X'D"1X/s  1 


l i 


where  r:  F ♦ [0,®)  satisfies 


1 ^ Itj-  r^ll ’V  = °^tl  ^ 

2 

i i ) — j-  r ( 1 1 » tp ) = o ( t ^ ) 

3 t ^ c 


(4.13) 


iii)  rft-j.t^)  is  non-decreasing  in  t^  and  non-increasing  in  t^ 
iv)  r(t^,t2)/t^  is  non-increasing  in  t-j 
V)  r^^^  is  non-decreasing  in  t2. 


If 


for  all  tj , 


o 1 r(t,.t2)  < 2(”+2)"'(trAD2-2>maxA02)/An|axA2D3, 

R 

t2  0,  then  6 (X,s)  is  minimax  against  the  loss  (2.5). 


(4.14) 
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The  class  of  functions  of  Corollary  4.1  includes  the  ridge  estimator 
6S(X,s),  given  componentwise  by 

adT1 

«!(X,s)  = (1  - t j — )X.  (4.15) 

adT  +X' D’  X/s+g+h/s  1 

where  a,  g and  h are  positive  constants.  Strawderman  (1976)  showed 
6S(X,s)  is  minimax  if 


ii)  g >_  2(p-2)(m+2)_1 
iii)  a ^ (min  di ) 2 ( p-2 ) (m+2)*1 . 


If  we  define 

, X ' D_1 X/s 

r(X ' D~  X,s)  » n 

X * D X/s+g+h/s 


(4.16) 


we  can  write  (4.15)  in  the  form  given  by  (4.13).  It  is  easy  to  check 
that  the  function  r in  (4.16)  satisfies  the  conditions  of  Corollary 
4.1  , and  that  the  minimax  bound  (4.14)  can  be  written 

a <_  (min  d^  )2 ( p-2 ) (m+2)’1 , 

and  that  the  restriction  g > 2(m+2)"^ (p-2)  is  not  necessary. 
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5.  Necessary  and  Sufficient  Conditions 

In  this  section  we  treat  the  case  of  known  variance  (i.e., 

X -v,  N(e,  D)),  and  show  that  condition  (4.1)  is,  in  fact,  necessary 
and  sufficient  for  minimaxity  of  the  ridge  estimator.  The  main 
theorem  of  this  section  is  the  following. 


Theorem  5.1 : Let  X -v  N(o,  D),  D = d i a g ( d ^ , . . . ,d  ),  and  let  the 

n 

ridge  estimator  s (X)  be  given  componentwise  by 


6-(X)  = (1 


a.d.r(X'D_1X) 

T T*Xi » 1 £ i i P* 

aid.r(X'D  1 X)+X ' D 'x  1 


where  a..  are  positive  constants  and  r:  F -*•  [0,»)  satisfies 


(5.1) 


i ) tr' (t)  = o(l ) , 
ii)  t3/2r"(t)  = o(l), 
iii)  r(t)  is  bounded  and  non-decreasing, 
iv)  r(t)/t  is  non-increasing. 


(X)  is  minimax  against  the  loss 


L(<5R(X)  ,e)  = (6R(X)-o)'(6R(X)-e) 


(5.2) 


if  and  only  if 


Oir(t)  < 2(trAD?-2,MXA02)/,|Tia/03.  (6.3) 


for  all  t _>  0,  where  A = diag(a^ ,. . . ,a  ). 
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Remark : Condition  (i)  is  a slightly  stronger  requirement  on  the  first 

derivative  of  r than  was  previously  need,  and  is  only  needed  for  the 
necessity  of  the  theorem.  The  sufficiency  of  the  theorem  holds  if 
t‘  r’(t)  = o(l).  It  should  be  noted,  however,  that  the  strengthening 
of  this  condition  merely  eliminates  the  more  pathological  functions  from 
the  possible  choices  of  r. 


Proof:  The  sufficiency  will  follow  from  Theorem  4.1.  Define  6 (X,s) 

componentwise  by 


n a.d .(m+2)"1rCX*D"1X/s) 

<(x,s)  = (i  - ^-4 1 — __  )X  i < i 

1 aidi(m+2)"lr(X,D"lX/s)+X'D'lX/s 


2 

where  r satisfies  conditions  i)  - iv)  and  s ^ xm  independent  of  X. 

D 

From  Theorem  4.1,  6^ (X,s)  is  minimax  if 


0 < r(t)  < 2<trAD2  - Vt  » 0. 

Since  lim  s(m+2) = 1 a.e.,  it  follows  that  lim  6^(X,s)  = 6^(X). 

n>w  m-««= 

Also,  from  Lebesgue's  Dominated  Convergence  Theorem  it  is  easy  to 

check  that 


lim  R(6R(X,s),e)  = R ( 6 R ( X ) ,o) , 

m-x» 

and  hence  the  sufficiency  is  proved. 

For  the  necessity,  we  first  define  A(6R,e)  = R(X,o)  - R(6R(X),o), 


and  from  (5.1)  and  (5.2)  we  have 
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r P (a.d.r(t)X. )2  2X.(X,-0.)a.d.r(t) 

A ( 6 R , o ) = I E ^ — L-i-i-JJL } , 

i = l (a^d  .r(t)+t)^  a_.d_.  r(t)  + t 


a .d . r( t)+t 

ii 


where  t = X'D  ^X.  As  in  Theorem  3.1,  we  integrate  the  last  term  by 
parts  and  rearrange  terms  to  get 


a(6R,0) 


i=l 


E{ 


aidir(t)(aidir(t)+4)X^ 

(a.d.r(t)+t)2 


2aid2r(t) 

ai.dir(t)+t 


4aid.X2tr’ (t) 
(aidi r( t)+t)2 


Now  we  apply  Lemma  3.1,  and  noting  that  condition  (i)  insures  that 
the  term  involving  r'(t)  is  o ( 1 0 j ),  we  have  for  sufficiently  large 


a(<5  ,o) 


2 ai.di.r(i)^idir(x)+4)e2 
i=l  (aidir(T)+i)2 


2a.jd2r(x) 

ai-di  "(t)+t 


♦ o(|e|-2). 


where  t = o'D'^0.  Now  applying  an  argument  similar  to  that  used  in 
Theorem  3.1  in  going  from  (3.18)  to  (3.20),  we  have  for  sufficiently 
large  e. 


R \ - r(^ ) i r(x)9' A2D^o+4o‘ADo  ni.m2  , ^ _/i„i-2' 


A(fi  ,0) 


e'D_10 


- 2trAD‘  } e o( | 0 |" ^ ) . 


(5.4) 


Define  a sequence  of  vectors  o*  as  follows.  Note  that  the  matrices 
2 3 2 

A D and  AD  have  common  eigenvectors,  and  let  a*  be  the  normed 

2 3 

eigenvector  of  A 0 corresponding  to  its  largest  root,  u*  is  then  also 
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the  normed  eigenvector  of  AD  corresponding  to  its  largest  root. 


o*  by 
n J 


)*  = n-DW(a*'D_1«*)\ 


Then  o*'D  = n and 


e*'A2D2e* 
n n 


e*'D"V 

n n 


a*'D-  A2D2D-u* 


= a*'A2D3«* 

= X a2d3. 
max 


Similarly,  e*' ADe*/e*D~^ e* 
J n n n n 


Amax^D  • Thus  (5.4)  becomes,  for 


UR(X),e*)  = ^ (r(n)A_A2D3+4A_AD2-2trAD2 


max  max 


+ o(  |o |'2) 


^ ‘max  A3D3lr(n)-2(  trAD3-  ‘.^^AD3 ).'A  A3D3 1 


+ o(n_1). 


Now  suppose  (5.3)  is  violated,  i.e.,  i T > 0 and  l > 0 such  that 
V t > T, 


L_ 


Define 
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r(t)  > (2(trflD2-2l|MXAD2)/ln,axA2D3)-  r > 0.  (5.5) 

It  then  follows  that  for  sufficiently  large  n 

a(6R(X),h*)  >cninl  AmaxA2D3  + o(n_1 ) (5.6) 

and  since  (5.5)  bounds  r(t)  from  below,  for  sufficiently  large  n (5.6) 
is  positive  and  6 (X)  is  not  minimax.  Therefore,  the  contrapositive 
and  hence  the  theorem  is  proved. 

The  proof  of  necessity  in  Theorem  5.1  did  not  require  conditions  (iii) 
on  r(').  We  state  this  in  the  following  corollary. 

D 

Corol lary  5.1:  Let  6 (X)  be  the  ridge  estimator  of  (5.1)  where 

r:  51  ->  [0,-)  is  bounded  and  satisfies 

i)  tr'(t)  = o ( 1 ) 
ii)  t3^2r"(t)  = o(l ). 

D 

If  6 (X)  is  minimax  against  the  loss  (5.2),  then 

lim  inf  r(t)  < 2(trAD2-lmjlxAD2)/lmaxA2D3. 

Thisted  (1976)  derived  necessary  conditions  similar  to  the 
above  for  the  case  r(t)  = constant. 
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6.  Minimaxity  and  Conditioning 

D 

The  crucial  condition  for  the  minimaxity  of  6 (X)  is  that 

0 < r(t)  < 2(trAD2-2AmaKAD2)/xmaxA2D3,  (6.1) 

and  hence,  it  must  necessarily  be  the  case  that 

trAD2  > 2a  AD2.  (6.2) 

max 

We  wish  to  point  out  the  following  inconsistency  between  the  original  goal 
of  ridge  regression  estimators  and  the  performance  of  minimax  ridge 
regression  estimators.  Hoerl  and  Kennard  saw  ridge  regression  as  a 
solution  to  the  "ill-conditioning'1  problem  that  was  mentioned  earlier, 
which  means,  in  particular,  that  the  a.'s  should  be  chosen  so  that 

a.  > a.  when  d.  > d.,  1 < i,  j < p (6.3) 

i-J  l - J V/ 

which  will  lower  the  condition  number  of  the  matrix  inverted  in  the 
regression  situation,  and  lead  to  what  Hoerl  and  Kennard  refer  to  as 
a more  "stable"  estimator. 

Choosing  the  a.'s  to  satisfy  (6.3)  is  also  intuitively  appealing 
for  two  reasons.  One,  it  is  Bayesian  in  nature,  and  two,  it  is  sensible 
to  add  only  small  amounts  of  bias  to  directions  with  good  information 
(small  d.'s).  An  inconsistency  arises,  however,  when  tne  condition 
of  minimaxity  is  forced  into  the  estimator.  If  the  d^'s  are  very  spread 
out  (as  will  occur  in  an  ill-conditioned  problem),  the  matrix  D is 
1 ikely  to  satisfy 

trD2  < 2x  D2. 

— max 


(6.4) 
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As  the  number  of  dimensions,  p,  increases,  it  is  more  likely  that  the 
inequality  in  (6.4)  will  reverse,  but  in  general  one  would  expect  (6.4) 
to  be  the  case.  If  the  ridge  estimator  is  to  be  minimax,  (6.2)  must 
hold  so  the  a.'s  must  be  chosen  to  "reverse"  the  inequality  in  (6.4), 
and  this  cannot  be  done  if  the  a^'s  satisfy  (6.3). 

The  result  is  an  incompatibility  between  minimaxity  and  the 
conditioning  problem.  Most  minimax  estimators  will  have  the  constants 
a.  satisfying 


ai  >_  a.  when  d^d.,  1 < i,  j < p,  (6.5) 

(see,  e.g.,  Strawderman  (1976)).  Choosing  the  a^s  to  satisfy  (6.5), 
however,  is  not  only  intuitively  unappealing  but,  in  many  cases,  will 
aggravate  the  conditioning  problem.  The  solution  seems  to  lie  in  a 
compromise  between  the  two  criteria,  possibly  resulting  in  an  estimator 
with  bounded  risk  which  will  improve  the  conditioning  problem.  This 
idea  is  developed  more  fully  in  Casella  (1977)  . 
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APPENDIX:  COMPUTATIONAL  LEMMAS 


Lc*  X Lave  a p-variate  normal  distribution  with  mean  0 and 
covariance  matrix  D.  Let  denote  a chi-square  random  variable 

with  p degrees  of  freedom  and  non-centrality  parameter  j/2. 


Lemma  0. 


If  K s,  Poisson  (a/ 2) 

7 


and 


Z | K 'v 


2 

Xp+2K ’ 


In  particular. 


if  K h (.Xp  (u) ) 

E[h(x^(a))]  = 


exists, 

V 2'h<V2K>|K]. 

X 


2 

then  Z s-  x («1  . 


Proof:  This  is  a relatively  well-known  result,  stated  here  simply 

for  completeness  (Sec,  e.g.  James  and  Stein  (1961)). 

The  next  five  lemmas  are  from  Bock  (1975),  and  are. stated 
without  proof. 


Lemma  1 : Let  h:  [0,ro)  (-u°,“>).  Then 

H(  h (X  1 1)*  1 X)  X ) = 0 K{h(x2  + 2(6,D"10))  }. 

Lemma  2 : If  D = diagonal  (dj,...,d  ),  and  h:  [0,°°)  -*•  (-“,“),  then 

EthfX'D^XJX2)  = d.  L(h(Xp  + 2(e,u'1e))} 

4 ei  E{h(x.p+4(e,D"10))} 
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Lemma  3:  Let  W be  symmetric  positive  definite,  and  let 

pxp  ' 

h:  (0,®)  -*•  (-®,®).  Then 


E(h  (X  ' 1)'  1 X J X 1 WX } = tr  WDEfh  (x^+2  (0  ' D* 1 0)  ) } 
♦ e'weE{h(Xp+4(e'D-Ie) )}. 


Lemma  4 : Let  h:  [0,®)  (-®,®).  Then,  if  the  expected  values  on 

both  sides  exist, 


2 ' 
xp+2 

Lemma  5:  Let  S:  [0,®)  -*•  10,®)  and  t:  10,®)  •>  j0,®)  be  monotone 

non-decreasing  and  non- increasing  functions,  respectively.  Let  W 
be  a non-negative  random  variable.  Assume  0(W),  E(S(W)), 
L(WS(W)),  E(t(W))  and  E(Wt(W))  exist  and  are  finite.  Then 


t(h(Xp) 1 


= L{ 


E{S(W) (E(W)-W) } <0  < E{t(W)(E(W)-W)}. 

2 

Lemma  6:  Let  h:  |0,®)  -»  (-®,®).  If  F.{h (x  (0 ' 6) ) ) exists,  then 

~ E{h(x2(0’0))}  = \ [E{h (Xp+2 (9 ' 0) ) ) - E{h(Xp(O'0)))| 

30  ? P r 

1 

for  1 <_  i p. 


Proof : 


E{h(Xp(e'0))} 


I I 

0 k-0 


. . . ,9'0,k  e 

h(y)(— ) — 


o'  0 


k! 


Cp*2k‘y 


dy 
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where  C +7^  = (r(^-+-^—  )2  ) 1 . Interchanging  the  order  of  summation 

and  integration  yields 


0'0 

0'0,k  -e  ..  2 


E{h(x,;(e'0))}  = l (^)K  'V-Eh(x^+7.) 


r r *0'0  k ( 1 og  8 1 6)  e 
= Z e e LI 

k=0 


■e)  <Tk  los  2 ii. (.2 


From  Lehmann  (1959),  Theorem  Pipage  52,  we  can  differentiate  the 
above  expression,  with  respect  to  log  8'G,  inside  the  summation. 
Thus, 


n4  « 


-V  loi’  ^ 

y (e-  *0,0ek  l0R  0,e)  £ —thfx!  ) 

kJ0  d log  8 ' 6 le  C ’ k!  lV2k 


” , - ilo ' e k log  e'e,,  1 3 o'o  ,,ek  l0K 

= kye  c (k-  2 a'ToirW5 — kT — 


bhV2k} 


Since 


3 O'O 
3 log  0'O 


rearranging  terms  yields 


® - i 0 ' 0 


3 log  0 


t,  ««#•••, ,*b  I ‘-jr  C^)k-)EhCx2  k) 


®>  - i 0 ' A 

0 * 0 v c ,0*0, k rw  2 , 

T-J0“ kT ch  kP+2k 
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e'o 

2 


t y e~  "6'9  fB'OJ  Ph,  2 

[jf0  j!  ( 2 ) h Xp+2+2j 


® - i 0' 0 

- " * r0'0,k  , 2 


k=0 


k! 


nh(xp+ 


2k 


)] 


0' 0 


lEh^V; 


(0'O))  - Eh(x‘(0'C-m 


From  the  chain  rule. 


-K  h{h(X2(Q'0))} 
90  P 

l 

and  since 


= ( 


9 log  0*9  * ^(Xp(0'0)) 
. .2  H 9 log  O'O 


9 02 

1 


9 log  O'O  _ 1 


90 

x 


2 O'O  ’ 1 — 1 — p’ 


the  result  is  proved. 


2 - 1 

Lemma  8 : Let  D = diagonal  (dj,...,d  ).  If  E { h C X (O'D  0))}  exists. 


then 


-K  EfhCx^O’D"1©))} 
90^  * 


= 217  tH{hCx1^2(0,D'1o))}  - 


E{h(Xp(0,D"10))}], 


for  1 < i <_  p. 


Proof:  Similar  to  that  of  Lemma  7. 
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Lemma  9:  Let  p >3  and  r:  F [0 , = ) satisfy 

i)  r(t)  is  non-decreasing 

ii)  r ( t ) / t is  non-increasing. 

2 2 2 7 

Let  v = xp+4^°'0^xm*  where  Xp+^o'o)  and  are  independent.  The 

function 


f (a ) 


E ( ar(v)(ar(y)m+4) 

(ar(vVV4^ 


.is  stricly  increasing  in  a if  either  0 < ar(t)  < 2(p-2)/m, 

V t > 0,  or"p-  >.  .4. 

Proof:  By  an  argument  similar  to  that  in  Lemma  7 we  can  differentiate 
inside  the  expectation,  and  after  some  algebra  we  obtain 


JLf(a),El  Lrl»).Ur(v)-*«) 


1(0*0)  — 


2ar( v) 


(>P+4V0  n>-  aVrCvT 2 


)}. 


Adding  + 2amr(v)(amr(v)+2)  ' inside  the  parentheses  yields 


3 fi-n  - rr  2r(v)(ar(v)m+4)  ,2  (ili,  2amr(v) 

" <-<v)4**>'e))3  9!  • 


+ E 


4ar  (v) 


(ar(v)*m+Vr(o'e)) 


7 ( m- x ) > • 

3 ' Anr 


From  condition  ii,  the  definition  of  v,  and  Lemma  5 it  follows  that  tue 
second  expectation  above  is  non-negative.  Now  from  Lemma  1,  the  first 
expectation  is  equal  to 
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EKE 


2r(w)  (ar(w)m+4) 

(ar  (w)  x. 


yar  ywym-rt; 2amr(w)ni/1  /,, 

2,  2 *3  xp+r+2K  1 K } * (1) 


m xp+4+2K 


2 2 

where  K v Poisson(0'e/2)  and  w = Xp+4+2|</xm-  Now  applying  Lemma  4 three 
times  shows  that  (1)  is  equal  to 


s(K)r(u)(ar(u)m+4)  ? - 

— 2 2 (xn_9j .ov^ 


(ar(u)xm+xp-2+2K) 


,3  ' xp-2+2K' 


(2) 


X<V2*2K  -5S?^2>IK»- 


where  s(K)  = 2(p+2+2K)_1 (p+2K)_1 (p-2+2K)_1  >0,  and  u = Xp_2+2K/x*. 
Define 


^ xp-2+2K’ 


s(K)r(u)(ar(u)m+4)(Xp_2+2K)3 

(ar(u)xm+xp-2+2K)3 


2 

which  is  non-decreasing  in  Xp_2+2K  from  the  conditions  on  r.  Adding 
+(p-2+2K)  inside  the  parentheses  shows  that  (2)  is  equal  to 


EKE  ^ xp-2+2K’xm) ( xp-2+2K'( P‘2+2K) )_ 

+ ¥ ^xp-2+2K*x^P-2+2K  - ^7^2  > 


-"■ax 


The  first  expectation  is  non-negative  from  Lemma  5,  and  if  p > 4,  the 
second  expectation  is  strictly  positive  since 

p-2+2K  > p-2  _>  2 > 2amr(u)(amr(u)+2)  . 
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If  p = 3,  since  0 <_  ar(t)  <_  2(p-2)/m,  the  only  concern  is  if  ar( tg)  = 

2(p-2)/m  = 2/m,  for  some  tg.  But  then  it  follows  from  condition 
(i)  that  ar(t)  = 2/m  , V t > tg,  and  a simple  argument  will  show 
that  the  first  expectation  in  (3)  is  positive.  Hence  the  derivative  of 
f(a)  is  always  positive  so  f(a)  is  strictly  increasing. | | 
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